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Abstract 

Background: Structural variations (SVs), such as insertions, deletions, inversions, and duplications, are a common 
feature in human genomes, and a number of studies have reported that such SVs are associated with human 
diseases. Although the progress of next generation sequencing (NGS) technologies has led to the discovery of a 
large number of SVs, accurate and genome-wide detection of SVs remains challenging. Thus far, various calling 
algorithms based on NGS data have been proposed. However, their strategies are diverse and there is no tool able 
to detect a full range of SVs accurately. 

Results: We focused on evaluating the performance of existing deletion calling algorithms for various spanning 
ranges from low- to high-coverage simulation data. The simulation data was generated from a whole genome 
sequence with artificial SVs constructed based on the distribution of variants obtained from the 1000 Genomes 
Project. From the simulation analysis, deletion calls of various deletion sizes were obtained with each caller, and it 
was found that the performance was quite different according to the type of algorithms and targeting deletion 
size. Based on these results, we propose an integrated structural variant calling pipeline (iSVP) that combines 
existing methods with a newly devised filtering and merging processes. It achieved highly accurate deletion calling 
with >90% precision and >90% recall on the 30x read data for a broad range of size. We applied ISVP to the 
whole-genome sequence data of a CEU HapMap sample, and detected a large number of deletions, including 
notable peaks around 300 bp and 6,000 bp, which corresponded to Alus and long interspersed nuclear elements, 
respectively. In addition, many of the predicted deletions were highly consistent with experimentally validated 
ones by other studies. 

Conclusions: We present ISVP, a new deletion calling pipeline to obtain a genome-wide landscape of deletions in 
a highly accurate manner. From simulation and real data analysis, we show that ISVP is broadly applicable to 
human whole-genome sequencing data, which will elucidate relationships between SVs across genomes and 
associated diseases or biological functions. 



Background 

Structural variation (SV) is one of the key features of 
genetic variations among individuals. SV includes several 
types of sequence-level polymorphisms such as insertions, 
deletions, inversions, translocations, and duplications or 
copy number variations. A number of studies have 
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implicated relationships between such SVs and human 
phenotypes including diseases such as cancer susceptibil- 
ity [1], mental disorders [2], metabolic disorders [3], and 
some types of intractable diseases [4-6]. 

While most single nucleotide polymorphisms (SNPs) 
are di-allelic and easier to detect, many SVs are multi- 
allelic in general and their patterns vary significantly 
among different SV types [7]. Consequently, the detec- 
tion of SVs is much more difficult than that of SNPs. 
Large-scale genomic SVs have conventionally been 
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investigated by Southern blot analysis. In later years, 
fluorescence in situ hybridization to DNA fibers (Fiber- 
FISH), which is based on the hybridization of fluores- 
cent probes onto chromosomes, has been widely used 
for the detection of SVs [8,9]. Such large genomic dele- 
tions in specific chromosomal regions have been 
reported to be associated with severe neuropathy and 
neurocognitive deficits [10,11]. The development of 
microarray technologies, such as array comparative 
genomic hybridization (array-CGH) and whole-genome 
SNP genotyping technologies, has enhanced the study of 
human SVs at the genome-wide level by detecting gains 
and losses of DNA regions compared to the reference 
genome [12-14]. A high-resoultion statistical method to 
detect SVs with a hidden Markov model from lUumina 
high-density SNP genotyping data has been proposed 
[15]. However, there are limitations to array-based 
methods for SV detection. First, because the SNP probes 
of these arrays do not uniformly represent SVs distribu- 
ted across the whole genome, some SVs outside the tar- 
geted region might not be detected at all. Second, the 
arrays can only detect SVs of relatively large sizes cover- 
ing more than several kilobases. Third, they cannot 
detect the precise breakpoints of the SVs. Finally, novel 
insertions cannot be detected since they are not pre- 
included in array probes. 

Recent progress in NGS technologies have enabled us 
to detect SVs more directly. More recently, several types 
of computational methods based on NGS data have 
been proposed for finding SVs with higher resolution 
than SNP array-based methods. In these analyses, typi- 
cally 35-100 bp paired-end reads are mapped to the 
reference genome, and SVs are inferred from the status 
of the mapped reads. The first approach, called read 
depth (RD), utilizes the depth of coverage of mapped 
reads [16]. Essentially, lower and higher depth values 
imply deletions and duplications of the region, respec- 
tively. The second approach, read pair (RP), uses anom- 
alous paired-end mappings of reads [17]. According to 
the separation distance and read orientation, SVs can be 
inferred. The third strategy, split read (SR), evaluates 
partial mapping of reads; this is employed by Pindel [18] 
and ClipCrop [19]. Pindel uses a portion of paired end 
reads in which one of the pair is unmapped. On the 
other hand, ClipCrop uses 'soft-clipped' reads, in which 
a part of the read maps to the reference genome and 
the other does not. The soft-clip information can be 
obtained from the mapped result encoded in the 
Sequence Alignment/Map (SAM) format [20]. The 
fourth approach, sequence assembly (AS), assembles 
novel sequences from short reads locally. However, cur- 
rently, there appear to be only a few integrative tools to 
detect all kinds of SVs for different types and size, and 
the characteristics and performance of these various 



tools have not yet been extensively studied. Recently, 
whole-genome sequencing data of many individuals 
have been produced very rapidly, as in the 1000 Gen- 
omes Project [21]. Hence, the development of a reliable 
and robust SV detection method from whole-genome 
sequencing data is urgently needed. 

First, we evaluated the performance of several SV 
detection tools with simulated paired-end sequencing 
data for various deletion sizes. Based on the evaluation, 
it would be possible to gain >90% precision and recall 
for a broad range of deletion sizes by combining differ- 
ent types of algorithms in a straightforward way. How- 
ever, deletions detected by different callers often contain 
multiple entries for the same deletions, and these entries 
may have differences in their sizes, positions, and their 
reliablities. Thus, a naive combination of multiple call- 
ers' results may fail to produce accurate detection calls. 
We propose the integrated Structural Variant calling 
Pipeline (iSVP), which combines existing SV detection 
methods and resolves this problem by selecting a reli- 
able subset of deletion calls and unifying duplicated 
entries. A tool based on a similar concept has been pro- 
posed, named SVMerge [22]. The tool also combines SV 
detected results from multiple callers and generates 
non-redundant calls like iSVP. The tool handles SVs 
other than deletions, but the size of the results is 
restricted to >100 bp. iSVP handles smaller deletions 
consistently and our procedure in the merging step does 
not depend on deletion size (see Methods section). In 
addition, the parameters employed in filtering and mer- 
ging steps of iSVP are determined by evaluating simula- 
tion data for a wide range of sizes. 

We also investigate the relationship between depth of 
coverage and SV detection performance with our pipe- 
line, and show that high coverage sequencing (more 
than 20 x) is necessary to obtain good performance in 
SV detection in the simulation experiment. Finally, we 
apply our proposed pipeline to whole genome sequen- 
cing data obtained from an NA12878 sample with an 
average depth of 45 x and present a comprehensive pic- 
ture of deletion events in which the resolution ranges 
from 1 bp to more than 100,000 bp. We also confirm 
that some of the predicted deletions with our pipeline 
have been validated in several independent experiments 
[23-25] and that its performance is equivalent to or bet- 
ter than that of tools used independently for all the 
datasets. 

Methods 

Evaluation of SV detection algorithms 

We first compare and evaluate the performance of exist- 
ing deletion callers from the synthetically generated 
NGS read data with various ranges of deletion size and 
depth of coverage. Typically, SV detection algorithms 
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Table 1 Summary of SV detection tools. 



Tool Algorithm Detectable SV types Simulation 30x CPU time, max. memory size NA1 2878 45x CPU time, max. memory size 



BD 


RP 


DEL, INS, INV, TRA 


T6h, 0.3Gb 


2.1 h, 0.5Gb 


Delly 


RP, RD, SR 


DEL, INV, DUP, TRA 


1.3h, 0.5Gb 


24h, 9Gb 


Pindel 


SR 


DEL, INS, INV, DUP, TRA 


19h, 3Gb 


37h, 3Gb 


HC 


AS 


DEL, INS, other 


68h, *9Gb 


180h, *9Gb 



SV detection tools are summarized according to algorithm types, detectable SV types, and computational resources required in our analyses. For each tool, CPU 
time and maximum memory size were measured for the 30x simulation data and the 45x whole genome sequence data of NA12878. RP, RD, SR, and RP stand 
for read pair, read depth, split read, and assembly approaches, respectively. 'BD' is BreakDancer. HC, Haplotype Caller; 'DEL', deletion; 'INS', insertion; 'INV, 
Inversion; 'DUP', duplication; TRA', translocation. 
• HC was performed with explicitly specified maximum memory size. 



are classified into the following types: read depth (RD), 
read pair (RP), split read (SR), assembly (AS), and com- 
binations of those algorithms. Table 1 summarizes the 
characteristics of deletion callers used in our compari- 
son. BreakDancer (BD) [17] is classified as an RP-type 
tool that uses discordant read pairs (a pair of reads that 
are not properly aligned) to detect SVs. This method 
uses a distribution of fragment lengths of paired-end 
reads to find anomalous read pairs. Its computational 
cost is much lower than that in other algorithms, and 
hence, easily applicable to find large deletion sizes. Pin- 
del [18] is an SR-type tool that uses part of paired end 
reads in which one of the pair is unmapped. It splits 
each unmapped read and determines the break points of 
SVs by an algorithm called pattern growth approach. 
Delly [26] uses a combination of RP, RD, and SR 
approaches. GATK Haplotype Caller (HC) [27,28] is an 
AS-type method that performs local de novo assembly 
of haplotypes via de Bruijn graphs to detect SNPs and 
indels at base-pair resolution. However, the method 
needs a large amount of computational resources in 
terms of both memory space and CPU time. In our 
computational analysis, we used BD Max version 1.1, 
Delly version 0.0.9, Pindel version 0.2.4, and HC of 
GATK version 2.5-2. 

Simulation data preparation 

We prepared an artificial human genome sequence by 
adding SVs, insertions, and deletions to the reference 
genome hgl9 at randomly selected regions. The size of 
the SVs follows the size distribution shown in the histo- 
gram at the bottom-right corner of Figure 1, which was 
constructed based on SV calling results from sequencing 
data in the 1000 Genomes Project [21,29]. We then 
synthetically generated 100-bp paired-end reads from the 
genome sequence and prepared a set of simulated 
sequence data with average depths of 5x, lOx, 20x, and 
30x. The insert size of paired-end reads was set to follow 
a normal distribution with a mean of 350 and a standard 
deviation of 50, and a 0.1% substitution error was consid- 
ered at each nucleotide position. For paired-end mapping 
of the simulated data, we used Burrows-Wheeler Aligner 



(BWA) [20] with the default options. The resultant SAM 
file was then used in subsequent SV callers. 

Evaluation metrics for deletion calls 

We defined the precision and recall of deletion calls for 
given size s as follows: 

precision fs) = max <jj,7Nf called size = s), 

'—' iefaU prepared SVs) 

i€{all called SVs with size - s) ' ' ^ ' 

recall fs) = max ftj/Nfprepared size = s), 

„ h:' ■ i€{ an called SVs] ™ ^ ' 

i€{all prepared SVs with size = s} 

where N is the number of called or prepared SVs and 
^ is a quality value that is defined for each overlap 
between prepared SVs and called SVs, and takes a value 
between 0 and 1. The quality q is defined as: 

= size (a,- n bj)/ size (a,- U bj), 

where a and b are the effective regions of called and 
prepared SVs, respectively. The effective region is 
extended from the actual region by a fixed length mar- 
gin as shown in Figure 2. The margin is introduced in 
order to retain SV calls that were correct but slightly 
deviated from the actual SV region due to ambiguity of 
mapping to the reference genome, often observed at 
interspersed repeats and low-complexity regions. We 
used 50 bp for the margin in our analysis; this resulted 
in 1-bp deletion call quality, with the position deviating 
10 bp from the prepared deletion being 0.8. The differ- 
ence in quality score arising due to the introduction of 
the margin converged to 0 as the SV size became larger. 

The proposed pipeline for calling deletions 

As shown in Figure 3, iSVP consists of three steps: 1) 
SV calling, 2) filtering, and 3) merging. In the SV calling 
step, we employ selected tools with different algorithms 
in parallel to detect a whole range of deletion sizes. 
Next, in the filtering step, we extract information such 
as the SV type, called position, and size from each call- 
er's output. We only utilize deletion calls whose size is 
within a predefined range that is determined to keep 
precision better than 90% from simulation data analysis 
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Figure 1 Deletion calling performance. The left-top, right-top, left-bottom, and right-bottom panels show precisions, recalls, F-measures, and 
the numbers of deletion calls, respectively, using simulation data with an average read depth of 30x. The right-bottom panel also shows the 
histogram of prepared deletions for each size. 
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Figure 2 An evaluation metric for deletion calls. For the evaluation of SV calling performance, a quality score was defined for each deletion 
call for overlaps of called regions. See IVlethods section for details. 
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Figure 3 iSVP for deletion calling. In iSVP, each SV caller is first executed in parallel with a given BAM file, and then the results of the callers 
are filtered and converted in successive filtering processes. Finally, these results are merged into a unified list of deletion calls in the BED format. 
The parameters described in the filtering process are determined by the evaluation of simulation data. AS, SR, and RP stand for assembly, split 
read, and read pair approaches, respectively. 



(the parameters employed are described in Figure 3). In 
the merging step, we first convert results from each 
caller to an extended BED format, which is convenient 
to compare overlaps of calls. In order to remove dupli- 
cations, we remove one of the SV candidates whose pre- 
cision is lower than the other if they overlapped each 
other by more than two thirds of their called regions. 
The precision for each call is determined by called size 
based on simulation results, and the detection of over- 
laps between calls is performed using BEDTools [30]. 
Finally, we merge the results into a unified SV call list. 

Results 

Simulation data analysis 

We evaluated the performance of each tool in detecting 
deletions in terms of precision, recall, and their harmonic 
mean (F-measure) with simulation data of varying dele- 
tion sizes and read coverages. The evaluation results with 
each tool for read coverage 30 x are summarized in Figure 
1. Notably, HC predicted deletions highly accurately with 
precision >90% for sizes <100 bp. Since the method is an 
AS approach, this result suggests that the local de novo 
assembly algorithm around deleted regions was successful 
for relatively short deletion sizes. Pindel performed better 
than other methods in terms of precision for deletion 
sizes between 100 bp and 30,000 bp, retaining >90% pre- 
cision and >90% recall. This result suggests that an SR 
approach, in which split reads were used for identifying 
breakpoints of deletions, was effective for identifying 
medium-size deletions. For deletion sizes >1,000 bp, BD 
and Delly performed comparably well, with precision 
>90% and recall >90%. These similar performances were 
possibly explained by the fact that they employ similar 
computational algorithms (read pair approach). The recall 
of Delly was better than that of BD in our analysis. 

Based on the evaluation of deletion calls with each 
tool for simulation data, we determined the ranges of 



deletion size used in the filtering step of iSVP (see 
Figure 3). We used BD, Pindel, and HC in the SV call- 
ing step of iSVP. Although the recall of Delly was better 
than that of BD for simulation results (see Figure 1), we 
used BD because the method showed slightly better pre- 
cision in longer SV regions. As we will discuss in the 
section on computational resources, HC needs more 
central processing unit (CPU) time and memory space 
than Pindel, and Pindel can also detect deletion sizes 
<100 bp with high precision and recall (see Figure 1). 
However, HC has even more precise calls in the region, 
and also determines the ploidy of each call, which is not 
estimated by Pindel. 

We confirmed that iSVP succeeded in achieving >90% 
precision and recall for almost all sizes of deletions 
when the average coverage of depth was 20x and 30x, 
as shown in Figure 4. We also found that it was hard to 
achieve precision and recall >90% at the same time for 
sequence data with average coverages lower than lOx. 
The result showed that the depth of coverage was con- 
sistently effective for almost all deletion sizes. Therefore, 
sequencing data of high coverage was essential for 
detecting deletions accurately and comprehensively. 

Real data analysis 

We obtained the whole genome sequence of HapMap 
sample NA12878 from lUumina HiSeq 2000. The 100- 
bp paired-end data with an average depth of 45 x was 
kindly provided by lUumina Inc. We applied iSVP to the 
NA12878 data and predicted a total of 398,518 deletions 
whose size ranged from 1 bp to 1,000,000 bp. The histo- 
gram of predicted deletion calls with iSVP is shown in 
Figure 5. It should be noted that the number of dele- 
tions exponentially declined with increasing deletion 
size. In addition, notable peaks around 300 bp and 6,000 
bp were found, which correspond to Alus and long 
interspersed nuclear elements (LINEs), respectively. 
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Figure 4 Comparing deletion calling performance by varying depths of coverage The left-top, right-top, and left-bottom panels show the 
precision, recall, and F-measure of deletion calls, respectively, for simulation data with average depths of 5x, lOx, 20x, and 30x. 



These peaks were also found and reported in the 1000 
Genomes Project [29]. 

In order to evaluate the prediction results of iSVP in 
comparison to those of other methods, we compared 
these results with those of the experimentally validated 
deletion sets from studies by Mills [25], Conrad [23], 
and Kidd [24]. The number of predictions with each 
method that was also validated by these studies is 
shown in Table 2. Here, we defined true positive calls if 
their quality scores (see Methods section) were more 
than 0.9. The typical deletion sizes in the Mills, Conrad, 
and Kidd data were around 300 bp, 5,000 bp, and 5,000 



bp, respectively. HC could not find most of the validated 
deletions because the AS algorithm by nature has diffi- 
culty finding relatively long deletions. iSVP and Pindel 
performed well with the Mills dataset, compared to BD 
and Delly. On the other hand, iSVP, BD, and Delly per- 
formed better than Pindel with the Conrad and Kidd 
datasets, as expected. 

Although the numbers of true positives obtained using 
Delly for the validated sets were close to those obtained 
using BD, the number of deletion calls with sizes >50 
bp was significantly larger than that seen with BD, as 
shown in Table 2. This indicates that excessive numbers 
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Figure 5 Predicted deletions in the NA12878 sample A 

histogram of predicted deletions with iSVP using NA12878 whole- 
genome sequence data (45x) of 100-bp paired-end reads. The 
number of deletions exponentially decreased with deletion size. In 
addition, notable peaks around 300 bp and 6,000 bp were 
observed, which correspond to Alu and LINE elements, respectively. 



of false positives might have been called with Delly. We 
examined the deletions only called by Delly and found 
that, they consist of calls supported by a few (2 or 3) 
reads, or reads of low mapping quality. As expected 
from the simulation data analysis, iSVP outperformed 
BD, Pindel, and HC for all the datasets, verifying that 
our approach is effective and robust for deletion calling 
from real data analysis. 

Computational resources 

In our analysis, we used Red Hat Enterprise Linux Ser- 
ver release 6.2 operating system with Intel Xeon CPU 
E5-2670 processors running at 2.60 GHz. For each SV 
calling tool, the required computational resources for 
SV detection from the simulation data with an average 
depth of 30x and real data (NA12878 whole genome 
sequence data with average depth of 45 x) are summar- 
ized in Table 1. As mentioned in the Background sec- 
tion, the largest amount of CPU time was required for 
AS, followed by SR and RP, using simulation data. By 



Table 2 Validation of deletion callings from NA12878 
data. 



Tool 


Called (>50 


Mills (n = 


Conrad (n = 


Kidd (n = 




bp) 


79) 


351) 


58) 


BD 


5,014 


13 


158 


49 


Delly 


286,289 


13 


168 


51 


Pindel 


7,265 


28 


143 


33 


HC 


1,880 


4 


0 


0 


iSVP 


8,130 


30 


166 


49 



The sensitivity of deletion calling for each tool was confirmed by comparing 
to multiple datasets from the literature (Mills [25], Conrad [23], and Kidd [24]). 
The n in parentheses shows the number of total validated deletions for each 
experiment. BD and HC stand for BreakDancer and Haplotype Caller, 
respectively. 



comparing the results of simulation and real data, we 
see that BD and Pindel required predictable amounts of 
CPU time and memory space based on the simulation 
data (i.e., nearly proportional to the coverage of read 
depth). For HC, we found that the CPU time was several 
times larger than expected. Delly required relatively lar- 
ger resources in terms of CPU time and maximum 
memory size for the real data (see Table 1). For iSVP, 
most of the computational resources that iSVP use are 
in the SV calling step. The CPU time and memory 
space consumed for the successive filtering and merging 
steps are less than 30 minutes and 2 Gb, respectively. 

Discussion and conclusions 

We investigated several types of SV calling tools and 
evaluated their performance with a detailed simulation 
analysis. We found that there were significant differ- 
ences in performance according to the employed algo- 
rithms and deletion size. Each tool had its strength and 
weakness, and there was no algorithm that consistently 
outperformed others. HC, an AS approach, performed 
especially well for deletions in the size range 1-100 bp. 
Pindel, an SR approach, performed relatively better than 
other methods for deletions of 100-10,000 bp. BD and 
Delly, both RP aproaches, were able to detect large dele- 
tions. Importantly, regardless of the algorithm used, 
high-coverage reads were consistently informative for 
detecting deletions. Based on the simulation results, we 
developed iSVP, a new pipeline to unify these methods 
with filtering and merging processes to comprehensively 
and reliably detect genomic SVs. Our approach suc- 
ceeded in achieving more than 90% precision and 90% 
recall for a broad range of deletion sizes. We showed 
that a relatively higher depth of coverage (more than 
20x) was required to gain good performance in SV 
detection from simulation experiments. This high-cover- 
age requirement may be one of the reasons why com- 
prehensive catalogs of SVs are still limited at the 
moment. 

By applying iSVP to human whole genome sequence 
data from a HapMap NA12878 sample, we detected 
numerous SVs that were biologically explainable, and 
some of them have been validated by other independent 
experiments. iSVP is broadly applicable to high-coverage 
whole genome sequencing data with reasonable compu- 
tational resources, which will enhance the genome-wide 
detection of SVs for the identification of disease-causing 
variants. However, the number of recalls from real data 
was smaller than that expected from the computational 
simulation. This problem may be related to the com- 
plexity of sequences around the SVs, which has not 
been sufficiently investigated yet. 

Our future work will include a study of the perfor- 
mance of iSVP for other various types of SVs other than 
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deletions, such as insertions, duplications, and transloca- 
tions, which are more difficult to detect and validate. 
Furthermore, developing a pipeline to genotype multiple 
samples simultaneously is also a challenging and pro- 
mising task. 

Competing interests 

The authors declare that they have no competing interests. 
Authors' contributions 

TM, NN, and MN conceived the study, TM, NN and KK designed the 
computational experiment, TM performed the analysis, and TIVl, NN, KK, and 
MN interpreted the results. MT, AO, YS, and YYK collaborated on data 
collection and interpretation of the results. TM, NN, KK, YS, YYK, and MN 
wrote the manuscript. All the authors read and approved the final 
manuscript. 

Acknowledgements 

We would like to thank the anonymous referees for their constructive 
comments and suggestions, which improved the quality of the manuscript. 
This work was supported (in part) by the MEXT Tohoku Medical Megabank 
Project. 

Declarations 

The publication costs for this article were funded by the MEXT Tohoku 
Medical Megabank Project. 

This article has been published as part of BMC Systems Biology Volume 7 
Supplement 5, 2013: Selected articles from the 24th International Conference 
on Genome Informatics {GIW2013). The full contents of the supplement are 
available online at http://www.biomedcentral.com/bmcsystbiol/supplements/ 
7/S6. 

Published: 13 December 2013 
References 

1. Shiien A, Tabori U, Marshall CR, Pienkowska M, Feuk L, Novokmet A, 
Nanda S, Druker H, Scherer SW, Malkin D: Excessive genomic DNA copy 
number variation in the Li-Fraumeni cancer predisposition syndrome. 
Proc Natl Acad Sci USA 2008, 1 05:1 1 264-1 1 269. 

2. Porteous D: Genetic causality in schizophrenia and bipolar disorder: out 
with the old and in with the new. Curr Opin Genet Dev 2008, 18:229-234. 

3. Pollex RL, Hegele RA: Genomic copy number variation and its potential 
role in lipoprotein and metabolic phenotypes. Cutr Opin LIpldol 2007, 
18:174-180 

4. Kumar D: Disorders of the genome architecture: a review. Genomic IVIed 
2008, 2:69-76. 

5. Ptacek T, Li X, Kelley JM, Edberg JC: Copy number variants in genetic 
susceptibility and severity of systemic lupus erythematosus. Cytogenet 
Genome Res 2008, 123:142-147. 

6. Wu YL, Yang Y, Chung EK, Zhou B, Kitzmiller KJ, Savelli SL, Nagaraja HN, 
Birmingham DJ, Tsao BP, Rovin BH, et al: Phenotypes, genotypes and 
disease susceptibility associated with gene copy number variations: 
complement C4 CNVs in European American healthy subjects and those 
with systemic lupus erythematosus. Cytogenet Genome Res 2008, 
123:131-141. 

7. Takezaki N, Nei M: Genomic drift and evolution of microsatellite DMAs in 
human populations. Mol Biol Evol 2009, 26:1835-1840. 

8. Parra I, Windle B: High resolution visual mapping of stretched DNA by 
fluorescent hybridization. Nat Genet 1993, 5:17-21. 

9. Florijn RJ, Bonden LA, Vrolijk H, Wiegant J, Vaandrager JW, Baas F, den 
Dunnen JT, Tanke HJ, van Ommen GJ, Raap AK High-resolution DNA 
Fiber-FISH for genomic DNA mapping and colour bar-coding of large 
genes. Hum IVIol Genet 1995, 4:831-836. 

10. Greenberg F, Guzzetta V, Monies de Oca-Luna R, Magenis RE, Smith AC, 
Richter SF, Kondo I, Dobyns WB, Patel PI, Lupski JR: Molecular analysis of 
the Smith-Magenis syndrome: a possible contiguous-gene syndrome 
associated with del(17)(p1 1.2). Am J Hum Genet 1991, 49:1207-1218. 



11. Loots GG, Kneissel M, Keller H, Baptist M, Chang J, Collette NM, 
Ovcharenko D, Plajzer-Erick i, Rubin EM: Genomic deletion of a long-range 
bone enhancer misregulates sclerostin in Van Buchem disease. Genome 
Res 2005, 15:928-935. 

12. LaFramboise T: Single nucleotide polymorphism arrays: a decade of 
biological, computational and technological advances. Nucleic Acids Res 
2009, 374181-4193. 

13. Pinkel D, Segraves R, Sudar D, Clark S, Poole I, Kowbel D, Collins C Kuo WL, 
Chen C, Zhai Y, et al: High resolution analysis of DNA copy number 
variation using comparative genomic hybridization to microarrays. Nat 
Genet 1998 20:207-211. 

14. Redon R, Ishikawa S, Fitch KR, Feuk L, Periy GH, Andrews TD, Fiegler H, 
Shapero MH, Carson AR, Chen W, et al: Global variation in copy number 
in the human genome. Nature 2006, 444:444-454. 

15. Wang K, Li M, Hadley D, Liu R, Glessner J, Grant SF, Hakonarson H, Bucan M: 
PennCNV: an integrated hidden Markov model designed for high- 
resolution copy number variation detection in whole-genome SNP 
genotyping data. Genome Res 2007, 17:1665-1674. 

16. Abyzov A, Urban AE, Snyder M, Gerstein M: CNVnator: an approach to 
discover, genotype, and characterize typical and atypical CNVs from 
family and population genome sequencing. Genome Res 2011, 21:974-984. 

17. Chen K, Wallis JW, McLellan MD, Larson DE, Kalicki JM, Pohl CS, 
McGrath SD, Wendl MC, Zhang Q, Locke DP, et al: BreakDancer: an 
algorithm for high-resolution mapping of genomic structural variation. 
Nat Metiiods 2009, 6:677-681. 

18 Ye K, Schuiz MH, Long Q, Apweiler R, Ning Z: Pindel: a pattern growth 
approach to detect break points of large deletions and medium sized 
insertions from paired-end short reads. Blolntormatlcs 2009, 25:2865-2871. 

19. Suzuki S, Yasuda T, Shiraishi Y, Miyano S, Nagasaki M: ClipCrop: a tool for 
detecting structural variations with single-base resolution using soft- 
clipping information. BIVIC Biolnformatlcs 2011, 12(Suppl 14):S7. 

20. Li H, Durbin R: Fast and accurate short read alignment with Burrows- 
Wheeler transform. Biolnformatlcs 2009, 25:1754-1760. 

21. The 1000 Genomes Project Consortium: An integrated map of genetic 
variation from 1,092 human genomes. Nature 2012, 491:56-65. 

22. Wong K, Keane TM, Stalker J, Adams DJ: Enhanced structural variant and 
breakpoint detection using SVMerge by integration of multiple 
detection methods and local assembly. Genome Biol 2010, 11:R128. 

23. Conrad DF, Pinto D, Redon R, Feuk L, Gokcumen O Zhang Y Aerts J, 
Andrews TD, Barnes C, Campbell P, et al: Origins and functional impact of 
copy number variation in the human genome. Nature 2010, 464:704-712. 

24. Kidd JM, Cooper GM, Donahue WF, Hayden MS, Sampas N, Graves T, 
Hansen N, Teague B, Alkan C, Antonacci F, et al: Mapping and sequencing 
of structural variation from eight human genomes. Nature 2008, 
453:56-64. 

25. Mills RE Luttig CT Larkins CE, Beauchamp A, Tsui C, Pittard WS, Devine SE 
An initial map of insertion and deletion (INDEL) variation in the human 
genome. Genome Res 2006, 16:1 182-1 190 

26. Rausch T Zichner T SchlattI A Stutz AM, Benes V, Korbel JO: DELLY: 
structural variant discovery by integrated paired-end and split-read 
analysis. Biolnformatlcs 2012, 28:1333-1339. 

27. DePristo MA Banks E, Poplin R, Garimella KV, Maguire JR, HartI C, 
Philippakis AA, del Angel G, Rivas MA, Hanna M, et al: A framework for 
variation discovery and genotyping using next-generation DNA 
sequencing data. Nat Genet 2011, 43:491-498. 

28. McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, 
Garimella K Altshuler D, Gabriel S, Daly M, DePristo MA: The Genome 
Analysis Toolkit: a MapReduce framework for analyzing next-generation 
DNA sequencing data. Genome Res 2010 20:1297-1303. 

29. Mills RE Walter K Stewart C Handsaker RE, Chen K, Alkan C Abyzov A 
Yoon SC, Ye K, Cheetham RK, et al: Mapping copy number variation by 
population-scale genome sequencing. Nature 201 1, 470:59-65. 

30. Quinlan AR, Hall IM: BEDTools: a flexible suite of utilities for comparing 
genomic features. Biolnformatlcs 2010, 26:841-842. 



doi:l 0.1 1 86/1 752-0509-7-S6-S8 

Cite this article as: Mimori et al: iSVP: an integrated structural variant 
calling pipeline from high-throughput sequencing data. BIVIC Systems 
Biology 201 3 7(Suppl 6):S8. 



